# check if you have ISLR package, if not, install it
library(ISLR)
library(data.table)
library(skimr)
library(ggrepel)
library(patchwork)
library(latex2exp)
library(factoextra)
library(ggbiplot)
library(gt)
library(here)
library(tidyverse)rm(list=ls())
i_am('hw2_sp2022.Rmd')Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.
Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.
Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.
For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use a linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors on the other hand.
NLSY79.csvbrca_subtype.csvbrca_x_patient.csvSelf-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.
In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.
The data is store in NLSY79.csv.
Here are the description of variables:
Personal Demographic Variables
Household Environment
Variables Related to ASVAB test Scores in 1981
| Test | Description |
|---|---|
| AFQT | percentile score on the AFQT intelligence test in 1981 |
| Coding | score on the Coding Speed test in 1981 |
| Auto | score on the Automotive and Shop test in 1981 |
| Mechanic | score on the Mechanic test in 1981 |
| Elec | score on the Electronics Information test in 1981 |
| Science | score on the General Science test in 1981 |
| Math | score on the Math test in 1981 |
| Arith | score on the Arithmetic Reasoning test in 1981 |
| Word | score on the Word Knowledge Test in 1981 |
| Parag | score on the Paragraph Comprehension test in 1981 |
| Numer | score on the Numerical Operations test in 1981 |
Self-Esteem test 81 and 87
We have two sets of self-esteem test, one in 1981 and the other in
1987. Each set has same 10 questions. They are labeled as
Esteem81 and Esteem87 respectively followed by
the question number. For example, Esteem81_1 is Esteem
question 1 in 81.
The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree
Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?
esteem <- read_csv(here('data','NLSY79.csv'))skim(esteem)| Name | esteem |
| Number of rows | 2431 |
| Number of columns | 46 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 44 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Gender | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| Job05 | 56 | 0.98 | 16 | 83 | 0 | 30 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Subject | 0 | 1 | 3504.50 | 2764.82 | 2 | 1592.5 | 3137 | 4668.5 | 12140 | ▇▇▃▁▂ |
| Education05 | 0 | 1 | 13.91 | 2.50 | 6 | 12.0 | 13 | 16.0 | 20 | ▁▁▇▃▂ |
| Income87 | 0 | 1 | 13398.80 | 11599.62 | -2 | 4500.0 | 12000 | 19000.0 | 59387 | ▇▆▂▁▁ |
| Income05 | 0 | 1 | 49414.80 | 46287.60 | 63 | 22650.0 | 38500 | 61350.0 | 703637 | ▇▁▁▁▁ |
| Weight05 | 0 | 1 | 183.10 | 41.71 | 81 | 150.0 | 180 | 209.0 | 380 | ▂▇▃▁▁ |
| HeightFeet05 | 0 | 1 | 5.18 | 0.57 | -4 | 5.0 | 5 | 5.0 | 8 | ▁▁▁▇▂ |
| HeightInch05 | 0 | 1 | 5.32 | 3.51 | 0 | 2.0 | 5 | 8.0 | 11 | ▇▅▅▅▇ |
| Imagazine | 0 | 1 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| Inewspaper | 0 | 1 | 0.86 | 0.35 | 0 | 1.0 | 1 | 1.0 | 1 | ▁▁▁▁▇ |
| Ilibrary | 0 | 1 | 0.77 | 0.42 | 0 | 1.0 | 1 | 1.0 | 1 | ▂▁▁▁▇ |
| MotherEd | 0 | 1 | 11.70 | 2.61 | 0 | 11.0 | 12 | 12.0 | 20 | ▁▁▇▂▁ |
| FatherEd | 0 | 1 | 11.85 | 3.56 | 0 | 10.0 | 12 | 14.0 | 20 | ▁▂▇▃▁ |
| FamilyIncome78 | 0 | 1 | 21251.67 | 13878.56 | 0 | 11167.0 | 20000 | 27500.0 | 75001 | ▇▇▂▁▁ |
| Science | 0 | 1 | 16.29 | 4.77 | 0 | 13.0 | 17 | 20.0 | 25 | ▁▂▆▇▅ |
| Arith | 0 | 1 | 18.59 | 7.15 | 0 | 13.0 | 19 | 25.0 | 30 | ▁▆▇▇▇ |
| Word | 0 | 1 | 26.61 | 7.02 | 0 | 23.0 | 28 | 32.0 | 35 | ▁▁▂▅▇ |
| Parag | 0 | 1 | 11.23 | 3.15 | 0 | 10.0 | 12 | 14.0 | 15 | ▁▂▂▆▇ |
| Number | 0 | 1 | 35.46 | 10.11 | 0 | 29.0 | 36 | 44.0 | 50 | ▁▂▅▇▇ |
| Coding | 0 | 1 | 47.05 | 15.24 | 0 | 38.0 | 48 | 57.0 | 84 | ▁▃▇▇▂ |
| Auto | 0 | 1 | 14.26 | 5.32 | 0 | 10.0 | 14 | 18.0 | 25 | ▁▆▇▆▅ |
| Math | 0 | 1 | 14.26 | 6.29 | 0 | 9.0 | 14 | 20.0 | 25 | ▂▇▇▆▇ |
| Mechanic | 0 | 1 | 14.43 | 5.10 | 0 | 11.0 | 14 | 18.0 | 25 | ▁▅▇▆▃ |
| Elec | 0 | 1 | 11.59 | 4.09 | 0 | 9.0 | 12 | 15.0 | 20 | ▁▅▇▇▃ |
| AFQT | 0 | 1 | 54.68 | 27.70 | 0 | 31.9 | 57 | 78.2 | 100 | ▅▆▇▇▇ |
| Esteem81_1 | 0 | 1 | 1.42 | 0.52 | 1 | 1.0 | 1 | 2.0 | 4 | ▇▆▁▁▁ |
| Esteem81_2 | 0 | 1 | 1.42 | 0.51 | 1 | 1.0 | 1 | 2.0 | 4 | ▇▆▁▁▁ |
| Esteem81_3 | 0 | 1 | 3.51 | 0.57 | 1 | 3.0 | 4 | 4.0 | 4 | ▁▁▁▆▇ |
| Esteem81_4 | 0 | 1 | 1.57 | 0.56 | 1 | 1.0 | 2 | 2.0 | 4 | ▇▇▁▁▁ |
| Esteem81_5 | 0 | 1 | 3.46 | 0.65 | 1 | 3.0 | 4 | 4.0 | 4 | ▁▁▁▆▇ |
| Esteem81_6 | 0 | 1 | 1.62 | 0.56 | 1 | 1.0 | 2 | 2.0 | 4 | ▆▇▁▁▁ |
| Esteem81_7 | 0 | 1 | 1.75 | 0.60 | 1 | 1.0 | 2 | 2.0 | 4 | ▅▇▁▁▁ |
| Esteem81_8 | 0 | 1 | 3.13 | 0.76 | 1 | 3.0 | 3 | 4.0 | 4 | ▁▂▁▇▆ |
| Esteem81_9 | 0 | 1 | 3.16 | 0.72 | 1 | 3.0 | 3 | 4.0 | 4 | ▁▂▁▇▆ |
| Esteem81_10 | 0 | 1 | 3.40 | 0.66 | 1 | 3.0 | 3 | 4.0 | 4 | ▁▁▁▇▇ |
| Esteem87_1 | 0 | 1 | 1.38 | 0.50 | 1 | 1.0 | 1 | 2.0 | 4 | ▇▅▁▁▁ |
| Esteem87_2 | 0 | 1 | 1.40 | 0.50 | 1 | 1.0 | 1 | 2.0 | 4 | ▇▅▁▁▁ |
| Esteem87_3 | 0 | 1 | 3.58 | 0.54 | 1 | 3.0 | 4 | 4.0 | 4 | ▁▁▁▅▇ |
| Esteem87_4 | 0 | 1 | 1.50 | 0.53 | 1 | 1.0 | 1 | 2.0 | 4 | ▇▇▁▁▁ |
| Esteem87_5 | 0 | 1 | 3.53 | 0.60 | 1 | 3.0 | 4 | 4.0 | 4 | ▁▁▁▆▇ |
| Esteem87_6 | 0 | 1 | 1.59 | 0.56 | 1 | 1.0 | 2 | 2.0 | 4 | ▆▇▁▁▁ |
| Esteem87_7 | 0 | 1 | 1.72 | 0.58 | 1 | 1.0 | 2 | 2.0 | 4 | ▅▇▁▁▁ |
| Esteem87_8 | 0 | 1 | 3.10 | 0.74 | 1 | 3.0 | 3 | 4.0 | 4 | ▁▃▁▇▅ |
| Esteem87_9 | 0 | 1 | 3.06 | 0.74 | 1 | 3.0 | 3 | 4.0 | 4 | ▁▃▁▇▅ |
| Esteem87_10 | 0 | 1 | 3.37 | 0.65 | 1 | 3.0 | 3 | 4.0 | 4 | ▁▁▁▇▇ |
#names(esteem)
# temp <- read.csv('data/NLSY79.csv', header = T, stringsAsFactors = F)
# # missing values? real variables vs. factors? are varable values reasonable?
#str(esteem)
#summary(esteem)
#levels(as.factor(esteem$Job05))
#table(as.factor(esteem$Job05))The dataset has 2431 subjects/people and 46 variables. The variables
are of two types–numeric and character.In some cases, variables which
ought to appropriately coded as categorical are coded as numeric. For
example, Imagazine, Inewspaper
Ilibrary and Ilibrary have yes,
no as values. Moreover, the esteem variables are ordinal
variables with categories strongly agree,
agree, disagree, and
strongly disagree.
Further, we see that Job05 has 56 missing values.
Self esteem evaluation
Let concentrate on Esteem scores evaluated in 87.
Esteem variables.
Pay attention to missing values, any peculiar numbers etc. How do you
fix problems discovered if there is any? Briefly describe what you have
done for the data preparation.esteem %>%
skim(Esteem87_1:Esteem87_10)| Name | Piped data |
| Number of rows | 2431 |
| Number of columns | 46 |
| _______________________ | |
| Column type frequency: | |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Esteem87_1 | 0 | 1 | 1.38 | 0.50 | 1 | 1 | 1 | 2 | 4 | ▇▅▁▁▁ |
| Esteem87_2 | 0 | 1 | 1.40 | 0.50 | 1 | 1 | 1 | 2 | 4 | ▇▅▁▁▁ |
| Esteem87_3 | 0 | 1 | 3.58 | 0.54 | 1 | 3 | 4 | 4 | 4 | ▁▁▁▅▇ |
| Esteem87_4 | 0 | 1 | 1.50 | 0.53 | 1 | 1 | 1 | 2 | 4 | ▇▇▁▁▁ |
| Esteem87_5 | 0 | 1 | 3.53 | 0.60 | 1 | 3 | 4 | 4 | 4 | ▁▁▁▆▇ |
| Esteem87_6 | 0 | 1 | 1.59 | 0.56 | 1 | 1 | 2 | 2 | 4 | ▆▇▁▁▁ |
| Esteem87_7 | 0 | 1 | 1.72 | 0.58 | 1 | 1 | 2 | 2 | 4 | ▅▇▁▁▁ |
| Esteem87_8 | 0 | 1 | 3.10 | 0.74 | 1 | 3 | 3 | 4 | 4 | ▁▃▁▇▅ |
| Esteem87_9 | 0 | 1 | 3.06 | 0.74 | 1 | 3 | 3 | 4 | 4 | ▁▃▁▇▅ |
| Esteem87_10 | 0 | 1 | 3.37 | 0.65 | 1 | 3 | 3 | 4 | 4 | ▁▁▁▇▇ |
I use skim to summarize all the Esteem
variables from the 1987 round. There are no missing values in these
variables.
Looking at the mean of these variables, we see that some variables
have low means whereas others have a higher means. For instance,
Esteem87_1, Esteem87_2,
Esteem87_4, Esteem87_6, and
Esteem87_7 have low means whereas rest of the variables
have higher means. In order to understand, why these patterns are
present, I review the variable description labels provided above. I
notice that there is a fundamental difference in the nature of the two
groups of questions: In particular, esteem questions that have low means
are associated with positive qualities and hence likely to have been
responded in more agreeable terms (i.e lower numbers which show
agreement or strong agreement) whereas those with a higher means are
associated with negative qualities and hence a greater likelihood of
being responded in terms of disagreement (and thus likely to have been
responded in terms of higher numbers which indicate disagreement).
Consequently, this creates difference in the order of the qualities
being measured. One way to fix this is to reverse the order of one set
of variables such that all variables are in the same order of
positive/negative qualities. The next question asks us to make these
changes.
data.esteem, then
data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)]
to reverse the score.)To reverse the esteem scores, I use the following set of codes:
#esteem[, c(1, 2, 4, 6, 7)] <- 5 - esteem[, c(1, 2, 4, 6, 7)]
esteem.rev <- esteem %>% mutate(
Esteem87_1 = 5 - Esteem87_1,
Esteem87_2 = 5 - Esteem87_2,
Esteem87_4 = 5 - Esteem87_4,
Esteem87_6 = 5 - Esteem87_6,
Esteem87_7 = 5 - Esteem87_7)
esteem.rev %>%
skim(Esteem87_1:Esteem87_10)| Name | Piped data |
| Number of rows | 2431 |
| Number of columns | 46 |
| _______________________ | |
| Column type frequency: | |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Esteem87_1 | 0 | 1 | 3.62 | 0.50 | 1 | 3 | 4 | 4 | 4 | ▁▁▁▅▇ |
| Esteem87_2 | 0 | 1 | 3.60 | 0.50 | 1 | 3 | 4 | 4 | 4 | ▁▁▁▅▇ |
| Esteem87_3 | 0 | 1 | 3.58 | 0.54 | 1 | 3 | 4 | 4 | 4 | ▁▁▁▅▇ |
| Esteem87_4 | 0 | 1 | 3.50 | 0.53 | 1 | 3 | 4 | 4 | 4 | ▁▁▁▇▇ |
| Esteem87_5 | 0 | 1 | 3.53 | 0.60 | 1 | 3 | 4 | 4 | 4 | ▁▁▁▆▇ |
| Esteem87_6 | 0 | 1 | 3.41 | 0.56 | 1 | 3 | 3 | 4 | 4 | ▁▁▁▇▆ |
| Esteem87_7 | 0 | 1 | 3.28 | 0.58 | 1 | 3 | 3 | 4 | 4 | ▁▁▁▇▅ |
| Esteem87_8 | 0 | 1 | 3.10 | 0.74 | 1 | 3 | 3 | 4 | 4 | ▁▃▁▇▅ |
| Esteem87_9 | 0 | 1 | 3.06 | 0.74 | 1 | 3 | 3 | 4 | 4 | ▁▃▁▇▅ |
| Esteem87_10 | 0 | 1 | 3.37 | 0.65 | 1 | 3 | 3 | 4 | 4 | ▁▁▁▇▇ |
Note that the reversal creates two categories of variables. The first
category consists of variables associated with positive qualities and
the reversal changes the response from 1 strongly disagree,
2 disagree, 3 agree and 4
strongly agree. The re-ordering gives a higher number to a
higher evaluation of the self-esteem
The second category of variables which are associated with negative qualities and continue to follow the exsitng order since those continue to give a higher number to the higher self-esteem evaluation.
I present two separate summaries of these variables categories discussed above. From the graphs, it can be seen that most respondents in the survey report a higher self esteem on both these categories. There are some exceptions though, for instance, a substantial fraction of respondents wished that they could have more self-respect (Esteem87_8) while several others reported that thet feel useless at times (Esteem87_9).
# First I summarize Esteem87_1 Esteem87_2,Esteem87_4, Esteem87_6, and Esteem87_7.
esteem.rev %>%
select(Esteem87_1, Esteem87_2, Esteem87_4,
Esteem87_6:Esteem87_7)%>%
pivot_longer(
cols = c(Esteem87_1, Esteem87_2, Esteem87_4,
Esteem87_6:Esteem87_7),
names_to = 'variable',values_to = 'value') %>%
mutate(variable = str_to_title(variable)) %>%
ggplot() +
geom_bar(mapping=aes(y=value)) +
labs(y='',
x='Count') +
facet_wrap(~variable, scales = 'free',nrow = 3) +
theme_minimal()# First I summarize Esteem87_1 Esteem87_2, Esteem87_6, and Esteem87_7.
esteem.rev %>%
select(Esteem87_3,Esteem87_5, Esteem87_8:Esteem87_10)%>%
pivot_longer(
cols = c(Esteem87_3:Esteem87_5, Esteem87_8:Esteem87_10),
names_to = 'variable',values_to = 'value') %>%
mutate(variable = str_to_title(variable)) %>%
ggplot() +
geom_bar(mapping=aes(y=value)) +
labs(y='',
x='Count') +
facet_wrap(~variable, scales = 'free',nrow = 3) +
theme_minimal()Looking at the correlation table, the first thing to note that all esteem variables are positively correlated.Furthermore, we find that respondents’ self worth is somewhat strongly correlated with the number of good qualities; respondents’ who “take a positive attitude towards myself and others” are somewhat strongly correlated with their personal satisfaction.
# correlation for all variables
esteem.pca <- esteem.rev %>%
select(Esteem87_1:Esteem87_10)
corr <-round(cor(esteem.pca),
digits = 2 # rounded to 2 decimals
)
corr ## Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1 1.00 0.70 0.45 0.53 0.40 0.46
## Esteem87_2 0.70 1.00 0.44 0.55 0.40 0.48
## Esteem87_3 0.45 0.44 1.00 0.41 0.55 0.41
## Esteem87_4 0.53 0.55 0.41 1.00 0.38 0.51
## Esteem87_5 0.40 0.40 0.55 0.38 1.00 0.40
## Esteem87_6 0.46 0.48 0.41 0.51 0.40 1.00
## Esteem87_7 0.38 0.41 0.34 0.42 0.37 0.60
## Esteem87_8 0.27 0.28 0.35 0.30 0.38 0.41
## Esteem87_9 0.24 0.26 0.35 0.29 0.35 0.36
## Esteem87_10 0.31 0.33 0.46 0.37 0.44 0.44
## Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1 0.38 0.27 0.24 0.31
## Esteem87_2 0.41 0.28 0.26 0.33
## Esteem87_3 0.34 0.35 0.35 0.46
## Esteem87_4 0.42 0.30 0.29 0.37
## Esteem87_5 0.37 0.38 0.35 0.44
## Esteem87_6 0.60 0.41 0.36 0.44
## Esteem87_7 1.00 0.39 0.35 0.39
## Esteem87_8 0.39 1.00 0.43 0.44
## Esteem87_9 0.35 0.43 1.00 0.58
## Esteem87_10 0.39 0.44 0.58 1.00
After this preceding checks, we now perform a PCA over the
Esteem variables. Since these values all have the same
scale from 1 to 4, there is no need to scale, however, we center each
Esteem category.
esteem.pca <- esteem.rev %>% select(Esteem87_1:Esteem87_10)
pca_esteem87 <- prcomp(esteem.pca, center=T, scale=F) # by default, center=True but scale=FALSE!!!
names(pca_esteem87) #check output ## [1] "sdev" "rotation" "center" "scale" "x"
# rotation: loadings
# x: PC scores
# sdev: standard dev of the PC scores pc.4$sdev
# pc.4$centerThe Table bellow reports the loadings of all the principle components.
pca_esteem87.loading <- pca_esteem87$rotation
knitr::kable(pca_esteem87.loading)| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Esteem87_1 | 0.235 | -0.374 | 0.057 | -0.012 | 0.393 | -0.010 | 0.201 | -0.352 | -0.066 | -0.691 |
| Esteem87_2 | 0.244 | -0.367 | 0.062 | 0.028 | 0.377 | -0.009 | 0.151 | -0.342 | -0.006 | 0.720 |
| Esteem87_3 | 0.279 | -0.149 | 0.121 | -0.436 | -0.078 | -0.012 | 0.595 | 0.568 | 0.105 | 0.023 |
| Esteem87_4 | 0.261 | -0.321 | 0.067 | 0.135 | 0.264 | -0.124 | -0.608 | 0.488 | 0.334 | -0.040 |
| Esteem87_5 | 0.312 | -0.131 | 0.061 | -0.592 | -0.361 | 0.404 | -0.407 | -0.257 | -0.074 | 0.002 |
| Esteem87_6 | 0.313 | -0.209 | -0.071 | 0.357 | -0.248 | -0.032 | -0.034 | 0.225 | -0.782 | 0.016 |
| Esteem87_7 | 0.299 | -0.163 | -0.122 | 0.497 | -0.512 | 0.205 | 0.203 | -0.153 | 0.502 | -0.034 |
| Esteem87_8 | 0.393 | 0.332 | -0.821 | -0.116 | 0.211 | -0.047 | -0.003 | 0.006 | 0.026 | 0.003 |
| Esteem87_9 | 0.398 | 0.578 | 0.442 | 0.211 | 0.282 | 0.428 | 0.023 | 0.054 | -0.032 | -0.008 |
| Esteem87_10 | 0.376 | 0.260 | 0.284 | -0.084 | -0.222 | -0.770 | -0.060 | -0.234 | 0.054 | -0.007 |
The loadings of PC1 and PC2, which we also verify are orthogonal.
# Loadings
pca_loadings <- pca_esteem87$rotation
# Prepare table with loadings for PC1 and PC2
tab_loadings <- cbind( c( 1:10 ), pca_loadings[ , 1:2 ] )
knitr::kable( tab_loadings,
col.names = c( 'Esteem Type', 'PC1', 'PC2' ),
caption = 'PC1 and PC2 loadings' ) %>%
kableExtra::kable_styling()| Esteem Type | PC1 | PC2 | |
|---|---|---|---|
| Esteem87_1 | 1 | 0.235 | -0.374 |
| Esteem87_2 | 2 | 0.244 | -0.367 |
| Esteem87_3 | 3 | 0.279 | -0.149 |
| Esteem87_4 | 4 | 0.261 | -0.321 |
| Esteem87_5 | 5 | 0.312 | -0.131 |
| Esteem87_6 | 6 | 0.313 | -0.209 |
| Esteem87_7 | 7 | 0.299 | -0.163 |
| Esteem87_8 | 8 | 0.393 | 0.332 |
| Esteem87_9 | 9 | 0.398 | 0.578 |
| Esteem87_10 | 10 | 0.376 | 0.260 |
b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)
From the loading values in the table, we can say that PC1 represents an average of all self-esteem scores and PC2 represents the difference between qualities which can be termed as self-respect (Esteem questions 8, 9 and 10) and those which can be categorised as associated with self-perception(1-7).
c) How is the PC1 score obtained for each subject? Write down the formula.
We can summarize the PC1 score of a Subject \(n\) with the following formula: \[PC1_n = 0.235*E1_n+0.244*E2_n+0.279*E3_n+0.261*E4_n+0.312*E5_n+0.313*E6_n+0.299*E7_n+0.393*E8_n+0.398*E9_n+0.376*E10_n.\]
d) Are PC1 scores and PC2 scores in the data uncorrelated?
The table below shows that PC1 and PC2 are uncorrelated:
# Check that PCs are uncorrelated
cor(pca_esteem87$x ) %>% round( 0 )## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## PC1 1 0 0 0 0 0 0 0 0 0
## PC2 0 1 0 0 0 0 0 0 0 0
## PC3 0 0 1 0 0 0 0 0 0 0
## PC4 0 0 0 1 0 0 0 0 0 0
## PC5 0 0 0 0 1 0 0 0 0 0
## PC6 0 0 0 0 0 1 0 0 0 0
## PC7 0 0 0 0 0 0 1 0 0 0
## PC8 0 0 0 0 0 0 0 1 0 0
## PC9 0 0 0 0 0 0 0 0 1 0
## PC10 0 0 0 0 0 0 0 0 0 1
e) Plot PVE (Proportion of Variance Explained) and summarize the plot.
Now, we check the proportion of variance explained by each PC using the Scree plot of PVE and the Cumulative Scree Plot of PVE below. We can see that the first two PC scores explain about 60% of the variance in the data.
summary(pca_esteem87)$importance ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.297 0.678 0.5717 0.520 0.4610 0.4330 0.375 0.3668
## Proportion of Variance 0.466 0.127 0.0905 0.075 0.0588 0.0519 0.039 0.0373
## Cumulative Proportion 0.466 0.593 0.6837 0.759 0.8175 0.8694 0.908 0.9457
## PC9 PC10
## Standard deviation 0.3499 0.2713
## Proportion of Variance 0.0339 0.0204
## Cumulative Proportion 0.9796 1.0000
plot(pca_esteem87)plot(summary(pca_esteem87)$importance[2, ], # PVE
ylab="PVE",
xlab="Number of PCs",
pch = 16,
main="Scree Plot of PVE for Esteem87")f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?
plot(summary(pca_esteem87)$importance[3, ], # PVE
ylab="Cumulative PVE",
xlab="Number of PCs",
pch = 16,
main="Scree Plot of PVE for Esteem87")g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data. Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)
The biplot graph shown below enables us to display the PC1 vs PC2 for each subject. We can verify that PC2 has different signs for Esteem questions 1-7 and 8-10, whereas the sign is the same for all scores for PC1.
## biplot pcs
ggbiplot( pca_esteem87,
obs.scale = 1,
var.scale = 1, alpha = 0.3 ) +
geom_hline( yintercept = 0, size = 0.5, color = 'blue', linetype = 'dotted' ) +
geom_vline( xintercept = 0, size = 0.5, color = 'blue', linetype = 'dotted' ) +
scale_color_discrete(name = '') +
labs( title = 'PC1 vs PC2 by Subjects and Loadings of Esteem Questions' ) After doing a Principal Component Analysis, we will do some clustering analysis to assess our data. We first try to assess what would be an optimal number of clusters. Using the elbow criteria, we cluster our data into two groups.
a) Find a reasonable number of clusters using within sum of squared with elbow rules.
esteem_org <- esteem %>% select(Esteem87_1:Esteem87_10)
set.seed(0)
# function to compute total within-cluster sum of square
wss <- function(df, k) {
kmeans(df, k, nstart = 10)$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 2:20
# extract wss for 2-15 clusters using sapply
wss_values <- sapply(k.values,
function(k) kmeans(esteem_org[,-1], centers = k)$tot.withinss)
# or use map_dbl()
#wss_values <- map_dbl(k.values, function(k) wss(esteem_org[,-1], k))
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")#set.seed(0)
fviz_nbclust(esteem_org[,-1], kmeans, method = "wss")b) Can you summarize common features within each cluster?
# two clusters
clus_esteem87 <- kmeans(esteem_org, centers = 2 )
# get clusters to original data
esteem$clus <- clus_esteem87$cluster
# get pcs to original data
esteem$pc1_esteem87 <- pca_esteem87$x[ , 1]
esteem$pc2_esteem87 <- pca_esteem87$x[ , 2]
cluster_pcplot <-
ggplot( esteem ) +
geom_hline( yintercept = 0, color = 'blue', size = 0.5, linetype = 'dotted' ) +
geom_vline( xintercept = 0, color = 'blue', size = 0.5, linetype = 'dotted' ) +
geom_point( aes( x = pc1_esteem87, y = pc2_esteem87,
color = factor( clus ),
shape = factor( clus ) ) ) +
labs( title = 'PC1 vs PC2 by cluster', x = 'PC1', y = 'PC2' ) +
scale_color_manual( name = '',
values = c( 'black', 'red' ),
labels = c( 'Cluster 1', 'Cluster 2' ) ) +
scale_shape_manual( name = '',
values = c( 1, 19 ),
labels = c( 'Cluster 1', 'Cluster 2' ) ) +
scale_x_continuous( breaks = seq( -5, 5, 1 ), limits = c( -4, 4 ) ) +
scale_y_continuous( breaks = seq( -4, 4, 1 ), limits = c( -4, 4 ) )
print( cluster_pcplot )
c) Can you visualize the clusters with somewhat clear boundaries? You
may try different pairs of variables and different PC pairs of the
esteem scores.
The plot of PC1 vs PC2 above presents a clear division between Cluster 1 and Cluster 2 in terms of their PC scores. The first PC score divides the sample in two groups: one (red) with PC1 greater than 0, and another one with PC1 smaller than 0. Therefore, we may say that Cluster 2 is composed by Subjects that reported higher scores of Self-Esteem.
We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.
EDA the data set first.
Personal information: gender, education (05), log(income) in 87, job type in 87. Weight05 (lb) and HeightFeet05 together with Heightinch05. One way to summarize one’s weight and height is via Body Mass Index which is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m². Note, you need to create BMI first. Then may include it as one possible predictor.
Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd,
FatherEd, FamilyIncome78. Do set indicators Imagazine,
Inewspaper and Ilibrary as factors.
You may use PC1 of ASVAB as level of intelligence
In order to examine associations between overall self-esteem (summarized by the PC1 score) and other relevant variables, we begin the EDA by grouping variables into two groups based on the personal and family characteristics of the respondents. Next, we check some basic associations between self-esteem and these variables using scatter plots, histograms or boxplots.
# PCA asvab
asvab <- esteem %>% select( Science, Arith, Word, Parag, Number, Coding, Auto, Math, Mechanic, Elec)
pca_asvab <- prcomp(asvab,
center = TRUE,
scale = TRUE)
esteem$pc1_asvab <- pca_asvab$x[ , 1 ]
#create BMI
esteem <- esteem %>% mutate (
BMI = Weight05*0.454 / ( HeightFeet05*0.3048 + HeightInch05*0.0254)^2,
Male = factor( if_else( Gender == 'male', 1, 0 ) ),
Esteem = pc1_esteem87,
ASVAB = pc1_asvab,
Imagazine = factor(Imagazine),
Inewspaper = factor(Inewspaper),
Ilibrary = factor(Ilibrary),
MotherEd, FatherEd,
logFamIncome = log(FamilyIncome78)
)# Personal characteristics
ggIncome <-
ggplot( esteem, aes( x = log(Income87), y = Esteem ) ) +
geom_point( ) +
geom_smooth( method = 'lm', se = F ) +
scale_x_continuous( breaks = seq( -10, 15, 1 ),
name = 'Income in 2005' ) +
scale_y_continuous( breaks = seq( -5, 5, 1 ) ) +
labs( title = 'Income vs Esteem PC1' )
ggEduc <-
ggplot( esteem, aes( x = Education05, y = Esteem ) ) +
geom_point( ) +
geom_smooth( method = 'lm', se = F ) +
scale_x_continuous( breaks = seq( 0, 20, 2 ),
name = 'Years of schooling completed (2005)' ) +
labs( title = 'Education vs Esteem PC1' ) +
scale_y_continuous( breaks = seq( -5, 5, 1 ) )
ggIntel <-
ggplot( esteem, aes( x = ASVAB, y = Esteem ) ) +
geom_point( ) +
geom_smooth( method = 'lm', se = F ) +
scale_x_continuous( breaks = seq( -10, 10, 2 ),
name = 'ASVAB score (PC1)' ) +
labs( title = 'ASVAB score (PC1) vs Esteem PC1' ) +
scale_y_continuous( breaks = seq( -5, 5, 1 ) )
ggBMI <-
ggplot(data = esteem %>% filter( BMI < 50 ),aes( x = BMI, y = Esteem ) ) +
geom_point( ) +
geom_smooth( method = 'lm', se = F ) +
scale_x_continuous( breaks = seq( 10, 200, 5 ),
name = 'BMI (kg/m^2)' ) +
labs( title = 'BMI vs Esteem PC1' ) +
scale_y_continuous( breaks = seq( -5, 5, 1 ) )
ggGender <-
ggplot(esteem) +
geom_boxplot( aes( x = factor( Male ), y = Esteem ) ) +
scale_x_discrete( labels = c( 'Female', 'Male' ), name = '' ) +
labs( title = 'Gender vs Esteem PC1' )
print(ggIncome)print(ggEduc)print(ggIntel)print(ggBMI)print(ggGender)ggMagaz <-
ggplot(esteem) +
geom_boxplot( aes( x = factor( Imagazine ), y = Esteem ) ) +
scale_x_discrete( labels = c( 'No', 'Yes' ),
name = 'At least a member of household read magazines in 1979' ) +
labs( title = 'Magazine reader in the household vs Esteem PC1' )
ggNewsp <-
ggplot(esteem) +
geom_boxplot( aes( x = factor( Inewspaper ), y = Esteem ) ) +
scale_x_discrete( labels = c( 'No', 'Yes' ),
name = 'At least a member of household read newspapers in 1979') +
labs( title = 'Newspaper reader in the household vs Esteem PC1' )
ggLib <-
ggplot(esteem) +
geom_boxplot( aes( x = factor( Ilibrary ), y = Esteem ) ) +
scale_x_discrete( labels = c( 'No', 'yes' ),
name = 'At least a member of household with a library card in 1979' ) +
labs( title = 'Library card holder in the household vs Esteem PC1' )
ggParentsEd <- esteem %>%
select(MotherEd, FatherEd)%>%
pivot_longer(
cols = c(MotherEd, FatherEd),
names_to = 'variable',values_to = 'value') %>%
mutate(variable = str_to_title(variable)) %>%
ggplot() +
geom_bar(mapping=aes(y=value)) +
labs(y='',
x='Count') +
facet_wrap(~variable, scales = 'free',nrow = 3) +
theme_minimal()
ggFamIncome <-
ggplot( esteem, aes( x = logFamIncome, y = Esteem ) ) +
geom_point( ) +
geom_smooth( method = 'lm', se = F ) +
scale_x_continuous( breaks = seq( -10, 15, 1 ),
name = 'log(Family Income in 1979)' ) +
scale_y_continuous( breaks = seq( -5, 5, 1 ) ) +
labs( title = 'Family Income vs Esteem PC1' )
print(ggMagaz)print(ggNewsp)print(ggLib)print(ggParentsEd)print(ggFamIncome)Based on the EDA, I decide to run four regression models with esteem as the response variable. The model specifications are provided below:
We decided not to include Family Income variable due to its likely collinearity with parental education. Further, based on the scatterplot of self-esteem and BMI, we restrict the value of BMI to less than 50 in order to remove a few outliers
b) Run a few regression models between PC1 of all the esteem scores and suitable variables listed in a). Find a final best model with your own criterion.
# Model 1: personal characteristics (sex + BMI)
model1 <-
lm( data = esteem %>% filter( BMI < 50 ),
Esteem ~
Male + BMI )
# Model 2: personal characteristics + intelligence and income (+ Education05 + ASVAB + logIncome)
model2 <-
lm( data = esteem %>% filter( BMI < 50 ),
Esteem ~
Male + BMI +
Education05 + ASVAB)
model3<-
lm( data = esteem %>% filter( BMI < 50 ),
Esteem ~
Male + BMI +
Education05 + ASVAB+ Income87)
# Model 4: personal characteristics + intelligence + family characteristics (+ Imagazine + Inewspaper + Ilibrary + MotherEd + FatherEd)
model4 <-
lm(data=esteem %>% filter( BMI < 50 ),
Esteem ~
Male + BMI +
Education05 + ASVAB +
Imagazine + Inewspaper + Ilibrary + FatherEd + MotherEd+ Income87)
stargazer::stargazer(
model1, model2, model3, model4,
#type = output_format,
# ci=TRUE, ci.level = .95,
keep.stat = c( "n", "rsq","sigma2", "ser" ) ) - How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met.
par(mfrow=c(1,2))
plot(model1, 1)
plot(model1, 2)par(mfrow=c(1,2))
plot(model2, 1)
plot(model2, 2)par(mfrow=c(1,2))
plot(model3, 1)
plot(model3, 2)par(mfrow=c(1,2))
plot(model4, 1)
plot(model4, 2)par( mfrow = c( 1, 2 ),
mar = c( 5, 2, 4, 2 ),
mgp = c( 3, 0.5, 0 ) ) # plot(fit3) produces several plots
plot( model4, 1, pch = 16) # residual plot
abline( h = 0, col = "blue", lwd = 2 )
plot( model4, 2 ) # qqplot - Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem.
Looking at the regression table in part (b) we see several things:
In model 1 which includes only BMI and gender, we see that being a male and a higher BMI score significantly positively associated with higher reported self-esteem.
In model 2, inclusion of education and intelligence score makes BMI irrelevant for reporting a higher self-esteem, but male, education and intelligence are now significantly positively associated.
In model 3, we also include the respondents income in 87, and find that being male no longer has significant association.
In model 4, we also include variables for family characteristics. We see that an individual’s level of education, intelligence, Income, and household access to newspapers are significantly positively associated with their self-esteem evaluation. In particular, each extra year of schooling increases PC1 score by 0.067 units, one point increase in ASVAB score increases the Esteem PC1 scores in 0.073 units. Further, controlling for all other variables, individuals in the household with access to newspapers were likley to have a higher PC1 score by 0.15 units relative to households. The magnitude of the increase in PC1 of self-esteem is very small for incomes.
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).
In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.
We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.
We first read the data using data.table::fread() which
is a faster way to read in big data than read.csv().
brca <- fread(here('data','brca_subtype.csv'))
# get the sub-type information
brca_subtype <- brca$BRCA_Subtype_PAM50
brca <- brca[,-1]
num_gene <- ncol(brca)
dim(brca)## [1] 1160 19947
Summary and transformation
Answer: 1160 patients.
b) Randomly pick 5 genes and plot the histogram by each sub-type.
# randomly select 5 gene
set.seed(10)
sample_idx <- sample(num_gene, 5)
brca %>%
select(all_of(sample_idx)) %>% # select column by index
pivot_longer(cols = everything()) %>% # for facet(0)
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(aes(fill = name)) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(legend.position = "none")c) Remove gene with zero count and no variability. Then apply logarithmic transform.
# remove genes with 0 counts
sel_cols <- which(colSums(abs(brca)) != 0)
brca_sub <- brca[, sel_cols, with=F]
dim(brca_sub)## [1] 1160 19669
# log to make it spread out
brca_sub <- log2(as.matrix(brca_sub+1e-10)) brca_subtype and the cluster labels.system.time({brca_sub_kmeans <- kmeans(x = brca_sub, 4)})## user system elapsed
## 11.288 0.187 11.479
# save the results as RDS
saveRDS(brca_sub_kmeans, "brca_kmeans.RDS")
brca_sub_kmeans <- readRDS("brca_kmeans.RDS")
# discrepancy table
table(brca_subtype, brca_sub_kmeans$cluster)##
## brca_subtype 1 2 3 4
## Basal 1 17 3 187
## Her2 39 9 26 17
## LumA 392 82 154 0
## LumB 98 22 111 2
Spectrum clustering: to scale or not to scale?
irlba::irlba().Answer: 4 PCs should be used based on the elbow rule because further increasing number of PCs will not give us much more information (PVE improvement lower than 0.03).
pca_ret <- prcomp(brca_sub, center = T, scale. = T)
# save the result
# only save a few leading PCs
pca_ret$rotation <- pca_ret$rotation[, 1:20]
pca_ret$x <- pca_ret$x[, 1:20]
saveRDS(pca_ret, here('data',"brca_pca.RDS"))
pca_ret <- readRDS(here('data',"brca_pca.RDS"))
# Plot scree plot of PVE
pve <- summary(pca_ret)$importance[2, 1:10]
plot(pve, type="b", pch = 19, frame = FALSE)b) Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering process? Why? (Hint: to put plots side by side, use `gridExtra::grid.arrange()` or `ggpubr::ggrrange()` or `egg::ggrrange()` for ggplots; use `fig.show="hold"` as chunk option for base plots)
Answer: clustering based on unscaled data are preferred. Rescaling the data could cause information loss, particularly if the original data had crucial information that was not captured by the mean and standard deviation. Besides, rescaling implicitly assumes that the distribution is normal, which is usually not the case in the distribution of genes. Unscaled data preserve the original information and help tp produce more distinguishable clusters.
# center and scale the data
brca_sub_scaled_centered <- scale(as.matrix(brca_sub), center = T, scale = T)
svd_ret <- irlba::irlba(brca_sub_scaled_centered, nv = 2)
pc_score <- (svd_ret$u[, 1:2])*(svd_ret$d[1:2]) # from svd
# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)
p1 <- data.table(x = pc_score[,1],
y = pc_score[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("Clulster 1", "Cluster 2","Cluster 3", "Cluster 4"),
values = c(4,8, 10, 16)) +
theme_bw() +
labs(color = "Cancer type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2")
brca_sub_unscaled_centered <- scale(as.matrix(brca_sub), center = T, scale = F)
svd_ret <- irlba::irlba(brca_sub_unscaled_centered, nv = 2)
pc_score <- (svd_ret$u[, 1:2])*(svd_ret$d[1:2]) # from svd
# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)
p2 <- data.table(x = pc_score[,1],
y = pc_score[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("Clulster 1", "Cluster 2","Cluster 3", "Cluster 4"),
values = c(4,8, 10, 16)) +
theme_bw() +
labs(color = "Cancer type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2")
p2ggpubr::ggarrange(p1, p2, ncol = 2, nrow = 1)Spectrum clustering: center but do not scale the data
Based on the elbow rule, we would choose 5 as the reasonable number of clusters.
svd_ret <- irlba::irlba(brca_sub_unscaled_centered, nv = 4)
pc_score <- (svd_ret$u[, 1:4])*(svd_ret$d[1:4]) # from svd
set.seed(10)
# function to compute total within-cluster sum of square
wss <- function(df, k) {
kmeans(df, k, nstart = 10)$tot.withinss
}
k.values <- 2:15
wss_values <- sapply(k.values,
function(k) kmeans(brca_sub_unscaled_centered, centers = k)$tot.withinss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")b) Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.
Answer: The optimal cluster number does not perform well in identifying real sub-types. Though cluster 1 and 3 pretty much identify “Basal”, there is a mix of LumA and lumB in cluster 4 or 5.
# apply kmean
kmean_ret <- kmeans(x = pc_score, 5)
# Extract the centroids
centroids <- data.frame(x = kmean_ret$centers[,1], y = kmean_ret$centers[,2])
p <- data.table(x = pc_score[,1],
y = pc_score[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("Clulster 1", "Cluster 2","Cluster 3", "Cluster 4", "Cluster 5"),
values = c(4,8, 10, 17, 20)) +
theme_bw() +
labs(color = "Cancer type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2")
p + geom_point(data=centroids,x = centroids$x, y = centroids$y, color = "black", size = 3, shape = 16)c) Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?
The kmeans clustering using 4PCs separates the four types almost as well as the kmeans clustering with original data. The betweenss value also increases with the clustering analysis based on 4 PCs, suggesting that the four clusters are even more separate from one another. Pre-process the data with PCA before cluster analysis can help to reduce the dimension of the data, making it easier to analyze. As I plotted in Q3 above, based on the plot of the proportion of Variance Explained (PVE), 4 PCs already capture most of the variance in the original data. This signals that 4 PCs already capture the most important feature of the data and it help with efficiency of cluster analysis.
svd_ret <- irlba::irlba(brca_sub_unscaled_centered, nv = 4)
pc_score <- (svd_ret$u[, 1:4])*(svd_ret$d[1:4]) # from svd
kmean_ret4 <- kmeans(x = pc_score, 4)
## rely on PC
table(brca_subtype, brca_sub_kmeans$cluster)##
## brca_subtype 1 2 3 4
## Basal 1 17 3 187
## Her2 39 9 26 17
## LumA 392 82 154 0
## LumB 98 22 111 2
## based on original data
table(brca_subtype, kmean_ret4$cluster)##
## brca_subtype 1 2 3 4
## Basal 2 189 1 16
## Her2 28 26 28 9
## LumA 401 0 153 74
## LumB 94 1 116 22
table(brca_sub_kmeans$betweenss, kmean_ret4$betweenss)##
## 50176178.6515725
## 67530386.0843437 1
d) Now we have an x patient with breast cancer but with unknown sub-type. We have this patient's mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered) Plot this patient in the plot in iv) with a black dot. Calculate the Euclidean distance between this patient and each of centroid of the cluster. Can you tell which sub-type this patient might have?
Euclidean distance between this patient and each of centroid of the cluster are 579.1515, 548.4973, 509.4033, 109.9861, respectively. The patient might be having Basal sub-type of breast cancer.
x_patient <- fread(here('data','brca_x_patient.csv'))
new_df <- rbind(brca, x_patient)
sel_cols2 <- which(colSums(abs(new_df)) != 0)
x_patient_sub <- new_df[, sel_cols2, with=F]
dim(x_patient_sub)## [1] 1161 19669
# log to make it spread out
x_patient_sub <- log2(as.matrix(x_patient_sub+1e-10))
patient_sub_centered <- scale(as.matrix(x_patient_sub), center = T, scale = F)
svd_ret2 <- irlba::irlba(patient_sub_centered, nv = 2)
pc_score <- patient_sub_centered %*% svd_ret2$v[, 1:2]
x_patient_pc <-pc_score[1161, ]
x_patient_pc <- data.frame(x = x_patient_pc[1], y = x_patient_pc[2])
# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)
p <- data.table(x = pc_score[,1],
y = pc_score[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("Clulster 1", "Cluster 2","Cluster 3", "Cluster 4"),
values = c(4,8, 10, 17)) +
theme_bw() +
labs(color = "Cancer type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2")
p + geom_point(data=x_patient_pc,x = x_patient_pc$x, y = x_patient_pc$y, color = "black", size = 3, shape = 16)# Calculate the Euclidean distances between items and the corresponding centroid
distance1 <- sqrt(rowSums((x_patient_pc - kmean_ret$centers[1,])^2))
distance2 <- sqrt(rowSums((x_patient_pc - kmean_ret$centers[2,])^2))
distance3 <- sqrt(rowSums((x_patient_pc - kmean_ret$centers[3,])^2))
distance4 <- sqrt(rowSums((x_patient_pc - kmean_ret$centers[4,])^2))
# View the distances
distance1;distance2; distance3;distance4## [1] 553
## [1] 500
## [1] 141
## [1] 577
This question utilizes the Auto dataset from ISLR. The
original dataset contains 408 observations about cars. It is similar to
the CARS dataset that we use in our lectures. To get the data, first
install the package ISLR. The Auto dataset should be loaded
automatically. We’ll use this dataset to practice the methods learn so
far. Original data source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg
Get familiar with this dataset first. Tip: you can use the command
?ISLR::Auto to view a description of the dataset.
Explore the data, with particular focus on pairwise plots and summary statistics. Briefly summarize your findings and any peculiarities in the data.
time have on MPG?Start with a simple regression of mpg
vs. year and report R’s summary output. Is
year a significant variable at the .05 level? State what
effect year has on mpg, if any, according to
this model.
Add horsepower on top of the variable
year to your linear model. Is year still a
significant variable at the .05 level? Give a precise interpretation of
the year’s effect found here.
The two 95% CI’s for the coefficient of year differ among (i) and (ii). How would you explain the difference to a non-statistician?
Create a model with interaction by fitting
lm(mpg ~ year * horsepower). Is the interaction effect
significant at .05 level? Explain the year effect (if any).
Remember that the same variable can play different roles! Take a
quick look at the variable cylinders, and try to use this
variable in the following analyses wisely. We all agree that a larger
number of cylinders will lower mpg. However, we can interpret
cylinders as either a continuous (numeric) variable or a
categorical variable.
Fit a model that treats cylinders as a
continuous/numeric variable. Is cylinders significant at
the 0.01 level? What effect does cylinders play in this
model?
Fit a model that treats cylinders as a
categorical/factor. Is cylinders significant at the .01
level? What is the effect of cylinders in this model?
Describe the cylinders effect over
mpg.
What are the fundamental differences between treating
cylinders as a continuous and categorical variable in your
models?
Can you test the null hypothesis: fit0: mpg is
linear in cylinders vs. fit1: mpg relates to
cylinders as a categorical variable at .01 level?
Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.
Describe the final model. Include diagnostic plots with particular focus on the model residuals and diagnoses.
Summarize the effects found.
Predict the mpg of the following car: A red car
built in the US in 1983 that is 180 inches long, has eight cylinders,
displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of
260. Also give a 95% CI for your prediction.
This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.
Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).
We can create a sample input vector (\(n = 40\)) for \(\mathbf{x}\) with the following code:
# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)Create a corresponding output vector for \(\mathbf{y}\) according to the equation
given above. Use set.seed(1). Then, create a scatterplot
with \((x_i, y_i)\) pairs. Base R
plotting is acceptable, but if you can, please attempt to use
ggplot2 to create the plot. Make sure to have clear labels
and sensible titles on your plots.
Here we generate the data
# Set seed to ensure replicabilty
set.seed(42)
# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)
# Create y based on the model equation
y <- 1 + 1.2*x + rnorm(40,0,2)Now we can visualize it
ggplot() +
geom_point(mapping=aes(x=x,y=y)) +
labs(title = TeX('Plotting $x$ Values Against $y$ Values Generated from the Model $y = 1 + 1.2x + \\epsilon$')) +
theme_minimal()The graph reveals that because x only ranges from 0 to 1, the highly dispersed errors (recall that \(\varepsilon\) has a standard deviation of 2) heavily obscure the relationship between x and y. This does not necessarily pose a problem for estimating the true coefficients of the model, aside from the fact that the estimates will be noisy (i.e. large standard errors).
lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates
look to be good?Here is how we estimate the coefficients
# Fit linear model
modelQ412 <- lm(y ~ 1 + x)
# Display model summary
summary(modelQ412)##
## Call:
## lm(formula = y ~ 1 + x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.397 -0.909 -0.297 1.766 4.163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.19 0.73 3.01 0.0047 **
## x -1.35 1.26 -1.07 0.2905
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.35 on 38 degrees of freedom
## Multiple R-squared: 0.0294, Adjusted R-squared: 0.00381
## F-statistic: 1.15 on 1 and 38 DF, p-value: 0.29
The estimates are not very good. The true intercept is 1, but the model estimates it as 2.19. Similarly, the true slope for x is 1.2 and the model estimates it as -1.35.
The estimated standard error of the residuals is 2.35 not too far from the true value of 2.
Because the sample size is rather small (\(N=40\)) we should use the Student t critical values rather than the z values from the Normal distribution.
# Get critical value
critVal <- abs(qt(0.025,38)) # Recall that the Student t distribution is symmetrical We can now compute the 95% confidence intervals as:
# Upper bounds
coef(modelQ412) + critVal*sqrt(diag(vcov(modelQ412)))## (Intercept) x
## 3.67 1.20
# Lower bounds
coef(modelQ412) - critVal*sqrt(diag(vcov(modelQ412)))## (Intercept) x
## 0.716 -3.892
Which is precisely the result we would have obtained using the
confint function.
confint(modelQ412)## 2.5 % 97.5 %
## (Intercept) 0.716 3.67
## x -3.892 1.20
Here is the required plot.
We can see that the estimated line is rather far from the real one.
Here is the residuals plot
There are no clear signs of violation of homoskedasticity (because the errors are indeed homoskedastic).
Here is the Normal QQ plot.
qqnorm(resid(modelQ412))
qqline(resid(modelQ412))Normally distributed residuals should fall long the line. There is some evidence that there could be a deviation from Normality, especially in the left tail (though we know this is not the case).
We know that at the population level the assumptions hold but in our specific sample, we could be concerned about non-Normality of the errors. Homoskedasticty seems satisfied even the sample.
This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.
Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. We include some sample code below, which should guide you in setting up the simulation. Note: this code is easier to follow but suboptimal; see the appendix for a more optimal R-like way to run this simulation.
# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40)
n_sim <- 100 # number of simulations
b1 <- 0 # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0 # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0 # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38) # Food for thought: why 38 instead of 40? What is t_star?
# Perform the simulation
for (i in 1:n_sim){I l
y <- 1 + 1.2 * x + rnorm(40, sd = 2)
lse <- lm(y ~ x)
lse_output <- summary(lse)$coefficients
se <- lse_output[2, 2]
b1[i] <- lse_output[2, 1]
upper_ci[i] <- b1[i] + t_star * se
lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))
# remove unecessary variables from our workspace
rm(se, b1, upper_ci, lower_ci, x, n_sim, b1, t_star, lse, lse_out) We will try a more compact approach with the code block below. We
could have been less verbose but we preferred a more self-explanatory
approach. We make use of the wonderful purr and
broom libraries part of the tidyverse.
generate_y <- function(x) {
y <- 1 + 1.2*x + rnorm(40,0,2)
}
fit_model <- function(x,y) {
fittedModel <- lm(y ~ 1 + x)
}
extract_coefficients <- function(fittedModel,thisSimulation) {
fittedModel %>%
broom::tidy() %>%
mutate(simNum = thisSimulation)
}
clean_coef_table <- function(x,thisSimulation) {
y <- generate_y(x)
fittedModel <- fit_model(x,y)
cleanCoefs <- extract_coefficients(fittedModel,thisSimulation)
}
cleanedCoefficients <- map_dfr(1:100, ~ clean_coef_table(x,.x))results$b1). Does the sampling distribution agree with
theory?Here is a density plot of the centered estimated slope for x (\(\beta_1\)) compared with the theoretical distribution, a Student t with 38 degrees of freedom. The two distributions are barely distinguishable. Notice that we had to center the sampling distribution at 0 (by subtracting its mean).
Let us first build the confidence intervals and add them as new columns to our dataframe.
cleanedCoefficients <- cleanedCoefficients %>%
mutate(upperBound = estimate + critVal*std.error,
lowerBound = estimate - critVal*std.error)Now let’s visualize them
Each horizontal line represents the 95% confidence interval for the slope in one of the samples. The solid vertical line shows the true slope. We can see that most of the confidence intervals contain the true slope. Let us make this statement more precise.
# We compute the proportion of intervals that contain the true value.
propContains <- cleanedCoefficients %>%
filter(term=='x') %>%
mutate(containsTrueVal = (lowerBound <= 1.2 & upperBound >= 1.2)/n()) %>%
pull(containsTrueVal) %>%
sum()We find that 0.96 of the intervals contain the true value, very close to the expected proportion (0.95).